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The complete genomic nucleotide sequence (29.7kb) of a Hong 
Kong severe acute respiratory syndrome (SARS) coronavirus 
(SARS-CoV) strain HK-39 is determined. Phylogenetic analysis 
of the genomic sequence reveals it to be a distinct member of 
the Coronaviridae family. 5' RACE assay confirms the presence 
of at least six subgenomic transcripts all containing the pre- 
dicted intergenic sequences. Five open reading frames (ORFs), 
namely ORF1a, 1b, S, M, and N, are found to be homologues to 
other CoV members, and three more unknown ORFs (X1, X2, 
and X3) are unparalleled in all other known CoV species. Opti- 
mal alignment and computer analysis of the homologous ORFs 
has predicted the characteristic structural and functional do- 
mains on the putative genes. The overall nucleotides conser- 
vation of the homologous ORFs is low (<5%) compared with 
other known CoVs, implying that HK-39 is a newly emergent 
SARS-CoV phylogenetically distant from other known mem- 
bers. SimPlot analysis supports this finding, and also suggests 
that this novel virus is not a product of a recent recombinant 
from any of the known characterized CoVs. Together, these re- 
sults confirm that HK-39 is a novel and distinct member of the 
Coronaviridae family, with unknown origin. The completion of 
the genomic sequence of the virus will assist in tracing its ori- 
gin. Exp Biol Med 228:866-873, 2003 
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cases of SARS have been reported from a total of 27 coun- 
tries worldwide. The number of fatalities has reached 
391, suggesting that this virus is highly virulent. A novel 
coronavirus (CoV) from respiratory specimens has been iso- 
lated from two SARS patients in Hong Kong (1). Its novelty 
has been confirmed by investigators from the Centers for 
Disease Control and Prevention (CDC), who have also iso- 
lated the CoV from patient samples (2). This CoV has not 
been previously identified either in humans or animals. The 
successful isolation of the novel CoV has not only permitted 
us to make a definitive diagnosis, but has also enabled us to 
complete the genomic sequence of the virus for further 
characterization. 

Coronaviridae is a viral family that infects birds and 
mammals and causes a variety of diseases (3). Six species of 
CoV genomes have been completely sequenced, namely 
murine hepatitis virus (MHV) (4); avian infectious bronchi- 
tis virus (IBV) (5); human CoV 229E (HCV 229E) (6); 
bovine CoV (BCY) (7); transmissible gastroenteritis virus 
(TGEV) (8); and porcine epidemic diarrhea virus (PEDV) 
(9). The size of the genome is about 30 kb, of which more 
than two-thirds is occupied by open reading frames (ORF) 
la, b. This gene contains two large ORFs, ORF 1a and ORF 
1b that are cotranslated with -1 ribosomal frameshifting 
mechanism (10-13). The gene products of both ORFs are 
believed to be processed into a number of functional sub- 
units (14, 15). The putative functional domains in ORF1 
include two to three papain-like domains, one 3C-like pro- 
tease domain, one growth factor/receptor-like domain, one 
polymerase domain, one metal ion-binding domain, and a 
helicase domain. The remaining one-third of the genome 
consists mainly of three structural proteins: a surface-spike 
glycoprotein (S), a transmembrane protein (M), and a nu- 
cleocapsid protein (N). Some CoV genomes (Group II) also 
contain the hemagglutinin-esterase (HE) gene. The spike 
glycoprotein has major functions in virus-host cell mem- 
brane fusion and interaction with host cell surface receptors. 
Membrane protein is responsible for virus budding, but 
other viral proteins may also be involved in organizing the 
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virus budding pre-Golgi membranes (16, 17). The nucleo- 
capsid protein has been found to be the central hydrophilic 
basic domain involved in RNA binding (18). Also, it has 
recently been suggested that CoVs contain an internal core 
of helical nucleocapsid, which is composed of both M and 
N proteins, in which the M protein was supposed to be 
found only in the envelope protein of the virus (19). Small 
ORFs are also usually found between these structural genes, 
depending on the virus species (20-22). 

We have extracted the genomic RNA from a tissue 
culture sample of the SARS CoV strain isolated from one of 
the earliest Hong Kong patients with SARS. This strain has 
been given the name HK-39. By using degenerated and 
specific primer PCR amplification, coupled with cDNA li- 
brary screening, we were able to obtain the 29-kb complete 
genomic sequence. In this paper, we report the complete 
nucleotide sequence of a Hong Kong SARS CoV and com- 
pare and analyze its genomic organization and individual 
genes with those of other known SARS CoV species. 


Materials and Methods 

Source of Materials. The initial starting material for 
this study was RNA isolated from fetal rhesus monkey kid- 
ney (FRhK-4) cells infected with a CoV HK-39 isolate from 
one of the earliest patients who died from SARS in Hong 
Kong (1). Total RNA was extracted using SV total RNA 
isolation system (Promega, Madison, WI) according to the 
manufacturer’s instructions. The first cDNA strand was 
reverse-transcribed using Superscript II (Invitrogen, Carls- 
bad, CA) and random primers. The cDNA was then used for 
cDNA library construction and specific amplification of vi- 
ral genomic sequences. 

Construction of cDNA Library. Double-stranded 
DNA linkers were added to the 3’ end of the first-strand 
cDNA for second strand cDNA synthesis (23). The end of 
the double-stranded cDNA was modified by T4 polymerase. 
The processed DNA was than subsequently cloned into 
pCR2.1 vector (Invitrogen). The clones with the cDNA in- 
sert were screened by PCR and were subjected to direct 
sequencing analysis. 

Amplification of Viral Genomic Sequences. De- 
generated primers covering the whole genome of the CoV 
were designed based on the genomic sequences of other 
CoVs. With assistance from the results obtained from li- 
brary screening, degenerated primers amplification, primer 
walking, and other public sources (e.g., SARS CoV strain 
Tor2, GenBank accession number NC_004718), the gaps in 
the genome were finally closed by specific primer PCR and 
sequencing. 

Sequencing. DNA fragments resulting from PCRs 
from viral genome and library screening were purified and 
directly sequenced by BigDye Terminator Cycle Sequenc- 
ing in an ABI Prism 3100 Genetic Analyzer (Perkin Elmer, 
Norwalk, CT). CoV sequences were confirmed by searching 
in NCBI BLAST-X. 


5’-RACE and 3’-RACE. The 5’ end of RNA genome 
and RNA transcripts were identified by using two different 
5'-RACE commercial kits: RNA ligase-mediated 5’-RACE 
(GeneRacer kit; Invitrogen), and end-switching 5’-RACE 
(SMART RACE cDNA Amplification kit; Clontech, Palo 
Alto, CA). One microgram of RNA extracted from virus 
infected cells was used for each 5'-RACE reaction in ac- 
cordance with the manufacturer’s instructions. Specific 
primers located near the 5’ end of each possible gene were 
used for PCR, and nested PCR was carried out if needed. 
For the 3'-RACE, the first-strand cDNA was reverse- 
transcribed with a 3’-RACE oligo, which had two primers 
(C1 and C2) annealed sites ahead of oligo dT. A set of 
specific primers at different sites of the genome were com- 
bined with 3’-RACE-anchored primers Cl or C2, respec- 
tively, and were used to amplify the possible 3’ end of 
poly(A)* mRNA transcripts or the RNA genome. All the 
PCR fragments of 5’- and 3’-RACE were subjected to se- 
quencing directly. 

Data Analysis. The 29.7-kb complete genome se- 
quence was assembled from the sequence contigs using 
SeqMan of the Lasergene Package (DNASTAR, Madison, 
WI). The putative ORFs were predicted by EditSeq of the 
Lasergene Package (DNASTAR). The SARS CoV com- 
plete genome sequence was compared with those of other 
known CoV species. Multiple sequence analysis and opti- 
mal alignments were conducted on MegAlign of the Laser- 
gene Package (DNASTAR). Phylogenetic tree construction 
and bootstrap tests were performed using MEGA 2.1 (Ari- 
zona State University, Tempe, AZ). The similarity plots of 
multiple sequence alignments were performed by SimPlot 
(Johns Hopkins University School of Medicine, Baltimore, 
MD). The coiled-coil motif prediction of the S protein was 
performed by COILS (24). Transmembrane domain topol- 
ogy predictions of the proteins were performed using 
TMHMM (CBS, Technical University of Denmark, Copen- 
hagen, Lyngby, Denmark). 


Results and Discussion 

Genome Sequence of HK-39. The first genomic 
sequence of HK-39 obtained in this study was a 240-bp 
ORF 1b fragment amplified by porcine reproductive and 
respiratory syndrome virus (PRRSV) specific primers (Fig. 
1A). PRRSV is a positive-stranded RNA virus and belongs 
to the order Nidovirales, the same order as the CoV. Six 
other fragments totaling about 3 kb covering different re- 
gions of the genome were successfully amplified by degen- 
erated primers (Fig. 1A). Specific primers were designed for 
walking to fill up the gaps. These methods, coupled with 
cDNA library screening, enabled more than 30% of the 
genome sequence to be obtained. Work carried out in the 
BCCA Genome Sciences Center (British Columbia Center 
for Disease in Canada) enabled us to uncover the com- 
plete viral genome in 2 days. When the genome sequence 
of HK-39 was compared with the sequence of Tor2, a 
high homology was found. By using the specific primer 
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walking (Fig. 1A), the final sequence of HK-39 was 
completed. 

Assembly and Analysis of the Genome. In total, 
125 sequence contigs were used in the complete genome 
assembly. The full genome length of SARS CoV HK-39 
was found to be 29,749 bp, including a 264-bp 5’- 
untranslated region (UTR) and a 339-bp 3’-UTR with a poly 
(A),5 tail. The full sequence was submitted to NCBI, and 
accession number AY278491 was assigned. The genomic 
organization of the virus was found to be similar to other 
virus members in the family Coronaviridae. SARS CoV 
HK-39 contains a 21-kb RNA-dependent RNA polymerase 
gene with two subunits (ORFla and ORFI1b) and three 
structural proteins, namely a surface-spike glycoprotein, a 
membrane protein, and a nucleocapsid protein. Three un- 
known ORFs (X1, X2, and X3) were also identified and 
confirmed by 5'-RACE (Fig.1B and C). The length and the 
position of the confirmed ORFs are shown in Figure 1A. 

Sequence Alignment and Phylogenetic Analy- 
sis. The Coronaviridae family is classified into three 
groups according to the structural proteins that affect their 
antigenicity (25). Phylogenetic analysis of the whole ge- 
nome and individual ORFs of HK-39 with other known 
CoV species showed that SARS-CoV only shares a very 
low level of homology with the other members of the CoV 
family at the level of nucleotide sequence and forms a sepa- 
rate group (Fig. 2A). The topology of the phylogenetic tree 
was similar in all analyses based on amino acid differences 
(Fig. 2, B—E). Although SARS-CoV showed a higher de- 
gree of amino acid sequence homology to Group 2 species, 
its genomic organization is more closely resembles that of 
Group | species (data not shown). Nevertheless, this con- 
troversy exists, and it is clear that SARS-CoV, including the 
Tor2 strain (isolated from SARS patient number 2 in To- 
ronto [1]), Urbani strain (GenBank accession number 
AY278741), and CUHK-W1 strain (isolated from a patient 
suffering from SARS in Hong Kong, GenBank accession 
number AY278554 [unpublished data]), does not fall into 
any of the existing phylogenetic groups, and is distant from 
all known human CoVs such as the human 229E and human 
CoV OC43, although they share a common host. SARS 
CoV is a new member of the CoV family, and is very 
distinct from all CoVs characterized hitherto. 

5’- and 3’-UTR of the Genome. The 5’-UTR of the 
genome was characterized by 5’-RACE assay (of ORF 1). 
We obtained the 264 bp upstream to the predicted AUG 
initiation codon of ORF 1, which is identical to the most 
updated version of the Urbani strain. Alignment of the seven 
5'-RACE sequences showed a consensus of 72 bp, which is 
composed of a leader sequence of 61 bp and an intergenic 
sequence at the last 9 bp. The intergenic sequence (IGS) 
5’-UAAACGAAC-3’ was identical for all of its ORFs, ex- 
cept X2 (Fig. 1B). Initiation codons were usually found 
immediately after or a few bases away from the IGS, except 
for ORF1 and M. An 11-codon “mini-ORF” was predicted 


31 bp downstream of the IGS and 128 bp upstream of the 
initiation codon of AUG at the 5’-UTR, which is similar to 
that of IBV (26). Eighteen specific primers located at dif- 
ferent sites of the genome were used to amplify the 3’ ends 
of the transcripts. Sequencing of the 3’-RACE products 
showed that only the regions at the 3’ end of the genome 
were amplified. The above results support the unique dis- 
continuous transcription system in CoVs, which generate a 
nested set of transcripts with common 3’ ends and a com- 
mon leader sequence on the 5’ ends. 3’-UTR (sequence 
downstream the N protein sequence) has been shown to be 
crucial in the regulation of transcription in a CoV. SimPlot 
analysis of 3'-UTR showed a remarkably high degree of 
similarity with IBV, which contradicts that of the other 
regions of the genome (Fig. 2F). A 32-bp conserved motif 
(nucleotides 29590-29621) was found in the 3'-UTR of 
SARS CoV HK-39. Such a motif shares a very high homol- 
ogy with the stem-loop II-like motif (s2m) found in IBV 
(27). Jonassen et al. (27) pointed out that such a motif was 
also found in some viruses that are distinct from IBV, like 
some animal astroviruses and picornavirus, and that it may 
be due to the consequence of the RNA transferring event 
that occurred between these viruses. Their findings in IBV, 
together with the identified putative s2m motif in HK-39, 
imply that these two CoVs are evolutionarily related. 

Putative Functional Domains of ORF 1. Se- 
quence alignment on the predicted amino acid sequence of 
ORF 1 revealed the uniqueness of this novel SARS CoV. In 
general, it shows an average of less than 50% similarity with 
any other groups of CoVs. In ORF1la, one papain-like (PL) 
domain, a 3C-like (3CL) protease, and a growth factor/ 
receptor-like (GFL) domain were predicted with the refer- 
ence to that of TGEV (Fig. 3A) (8). The organization is 
similar to other members of the virus family (12, 13, 28). 
Two characteristic and remarkably hydrophobic regions lo- 
cated at both sides of the 3CL domain were identified by 
computer predictions. In total, 10 putative 3CL cleavage 
sites were predicted in ORFlab, and their locations are 
shown in Figure 3A. One of the necessary elements for the 
ribosomal frame-shifting mechanism (10, 13), a ribosomal 
slippage site UUAAAC, was also identified at 13392 bp, 
15 bp upstream of the stop codon of ORFla. Alignment of 
the predicted amino acid sequence of ORFI1b to known 
strains revealed the presence of conserved putative do- 
mains, including RNA polymerase domain (POL), metal 
ion-binding (MIB), and helicase (Hel) domain (Fig. 3A). 
These findings support the conclusion that the SARS CoV is 
a typical member of the CoV family. 

Topologies of Structural Protein. The putative 5’ 
region of the spike protein of SARS CoV shares good ho- 
mology with the bovine S1 region and contains a possible 
polybasic cleavage site of SLLR at amino acid 667. Two 
coiled-coil structures were predicted in the putative S2 re- 
gion (Fig. 3, B2). Normally, two to three clusters of heptad 
repeats are found in the S2 region in other CoVs. The heptad 
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repeats of these coiled-coil regions were identified in other 
CoVs (Fig. 3, B1). A conserved transmembrane domain was 
predicted in the C-terminal of the S2 region (Fig. 3, B3). 
Amino acid sequence alignment of the transmembrane do- 
main has shown that this region is highly conserved. All 
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coronavirus M proteins show to posses a triple membrane- 
spanning protein with a Nexo-Cendo configuration (17, 29, 
30). TMHMM predicts three transmembrane domains on 
the N-terminal of the M gene of SARS CoV (Fig. 3C). It is 
observed in the alignment of N protein sequence that there 
are three stretches of amino acid residues that are highly 
conserved among the 16 CoV species. Such residues are 
believed to be involved in its structural maintenance and direct 
interaction with RNA in the case of IBV (Fig. 3D) (25). 


Conclusion. We have completed the sequencing of a 


novel CoV HK-39. The distinctive molecular genomic and 
phylogenetic characteristics of this novel virus seem to war- 
rant its assignment to a new and distinct group IV of the 
Coronaviridae family. 
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